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ABSTRACT 

Many astronomical phenomena exhibit patterns that have periodic behavior. An important step 
when analyzing data from such processes is the problem of identifying the period: estimating the 
period of a periodic function based on noisy observations made at irregularly spaced time points. 
This problem is still a difRcult challenge despite extensive study in different disciplines. This paper 
makes several contributions toward solving this problem. First, we present a nonparametric Bayesian 
model for period finding, based on Gaussian Processes (GP), that does not make assumptions on the 
shape of the periodic function. As our experiments demonstrate, the new model leads to significantly 
better results in period estimation especially when the lightcurve does not exhibit sinusoidal shape. 
Second, we develop a new algorithm for parameter optimization for GP which is useful when the 
likelihood function is very sensitive to the parameters with numerous local minima, as in the case of 
period estimation. The algorithm combines gradient optimization with grid search and incorporates 
several mechanisms to overcome the high computational complexity of GP. Third, we develop a novel 
approach for using domain knowledge, in the form of a probabilistic generative model, and incorporate 
it into the period estimation algorithm. Experimental results validate our approach showing significant 
improvement over existing methods. 
Subject headings: data analysis, variable stars 



1. INTRODUCTION 

Many astronomical phenomena exhibit periodic behav- 
ior. Discovering their period and the periodic pattern 
they exhibit is an important task toward understanding 
their behavior. A significant effort has been devoted to 
the analysis of lightcurves from periodic variable stars. 
For example, the top part of Figure ([T]) shows the mag- 
nitude of a light source over time. The periodicity of the 
light source is not obvious before we fold it. However, 
as the bottom part illustrates, once folded with the right 
period we get convincing evidence of periodicity. The ob- 
ject in this figure is classified as an eclipsing binary (EB). 
Other sources show periodic var iability due to processes 
internal to the star ( Petit|[l987 ). 

The problem of period estimation from noisy and ir- 
regularly sampled observations has been studied before 
in several disciplines. Most approaches identify the pe- 
riod by some form of grid search. That is, the problem 
is solved by evaluating a criterion $ at a set of trial pe- 
riods {p\ and selecting the period p that yields the best 
value for The commonly- used techniques vary in 

the form and paranietrization of $, the evaluation of the 
fit quality between model and data, the set of trial peri- 
ods searched, and the complexity of the resulting proce- 
dures. Two methods we use as base l ines in our stud y are 
the LS periodogram (Scargle 1982 Reimann |1994') and 
the p hase dispersion mmimization (PDMj ([Stellingwerf 
19781, both known for their success in empirical stud- 



ies. The LS method is relatively fast and is equivalent 
to maximum likelihood estimation under the assumption 
that the function has a sinusoidal shape. It therefore 
makes a strong assumption on the shape of the underly- 
ing function. On the other hand, PDM makes no such 
assumptions and is more generally applicable, but it is 
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Fig. 1. — Top: brightness of an eclipsing binary star over time; 
Bottom: brightness versus phase. 

slower and is less often used in practice. A more exten- 
sive discussion of related work is given in Section [5] 

The paper makes several contributions toward solving 
the period estimation problem. First, we present a new 
model for period finding, based on Gaussian Processes 
(GP), that does not make strong assumptions on the 
shape of the periodic function. In this context, the pe- 
riod is a hyperparameter of the covariance function of 
the GP and accordingly the period estimation is cast as 
a model selection problem for the GP. As our experi- 
ments demonstrate, the new model leads to significantly 
better results compared to LS when the target function is 
non-sinusoidal. The model also significantly outperforms 
PDM when the sample size is small. 
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Second, we develop a new algorithm for period estima- 
tion within the GP model. In the case of period esti- 
mation the likelihood function is not a smooth function 
of the period parameter. This results in a difficult es- 
timation problem whic h is not well exp lored in the GP 
literature (Rasmussen & Williams 20051. Our algorithm 



combines gradient optimization with grid search and in- 
corporates several mechanisms to improve the complex- 
ity over the naive approach. 

In particular we propose and evaluate: an approxima- 
tion using a two level grid search, approximation using 
limited cyclic optimization, a method using sub-sampling 
and averaging, and a method using low-rank Cholesky 
approximations. An extensive experimental evaluation 
using artificial data identifies the most useful approxi- 
mations and yields a robust algorithm for period finding. 

Third, we develop a novel approach for using astro- 
physics knowledge, in the form of a probabilistic genera- 
tive model, and incorporate it into the period estimation 
algorithm. In particular, we propose to employ the gen- 
erative model to bias the selection of periods by using it 
as a prior over periods or as a post-processing selection 
criterion choosing among periods ranked highly by the 
GP. The resulting algorithm is applied and evaluated on 
astrophysics data showing significantly improved perfor- 
mance over previous work. 

The next section provides some technical background 
and defines the period estimation problem as GP infer- 
ence. The following three sections present our algorithm, 
report on experiments evaluating it and applying it to as- 
trophysics data, and discuss related work. The final sec- 
tion concludes with a summary and directions for future 
work. 

2. PRELIMINARIES: GP FOR PERIOD FINDING 

This section provides technical background on GPs 
and their optimization procedures and defines the period 
finding problem in this context. 

Throughout the paper, scalars are denoted using ital- 
ics, as in x,y € IR; vectors and matrices use lowercase 
and capital bold typeface, as in a;,j/,K,A, and Xi de- 
notes the iih entry of x. For a vector x and real valued 
function / : IR — > IR, we extend the notation for / to 
vectors so that f{x) = [f{xi), • • • , f{xn)]'^ where the su- 
perscript T stands for transposition. I is the identity 
matrix. 

2.1. Gaussian Processes 

This section gives a brief review of Gaussian processes 
reg ression. A more extensive introduction can be found 
in lIRasmussen fc Williams||2005[ |Bishop||2006[ ) . 

We start with the following regression model. 



y = f-wix) + e 



(1) 



where fw (x) is the regression function with parameter w 
and e is iid Gaussian noise. For example, in linear regres- 
sion fw{x) = w^x and therefore y ^ N{w'^ x* ,1/ a"^). 
Given the data V — {xi,yi},i — 1, • • • , A^, one wishes 
to infer w and the basic approach is to maximize the 
likelihood C{w,V) =Vt{V\w). 

In Bayesian statistics, the parameter w is assumed to 
have a prior probability Pr(i(;) which encodes the prior 
belief on the parameter. The inference task becomes cal- 
culating the posterior distribution over which, using 



the Bayesian formula, is given as 

Pr(K;|X') cx Pr(X'|u;) Vi{w). 



(2) 



The predictive distribution for a new observation a;* is 
given by 



Pr(/(a?*)|X>) j VT{J{x*)\w)Vi{w\V)dw. (3) 

Returning to linear regression, the common model as- 
sumes that the prior for it; is a zero-mean multivariate 
Gaussian distribution, and the posterior turns out to be 
multivariate Gaussian as well. In contrast with many 
Bayesian formulations, the use of GP often allows for 
simple inference or calculation of desired quantities be- 
cause of properties of multivariate Gaussian distributions 
and corresponding facts from linear algebra. 

This approach can be made more general using a non- 
parametric Bayesian model. In this case we replace the 
parametric latent function by a stochastic process / 
where /'s prior is given by a Gaussian process. A GP is 
specified by a mean function (assumed to be zero in this 
paper) and covariance function /C(-, •). This allows us to 
specify a prior over functions / such that the distribu- 
tion induced by / over any finite sample is normally dis- 
tributed. More precisely, the GP regression model with 
zero mean and covariance function /C(-, •) is as follows. 
Given sample points \xi, . . . , £c„]-^ let K = {]C{xi, Xj))ij. 
The induced distribution on the values of the function at 
the sampling points is 



(4) 



where Af denotes the multivariate normal distribution. 
Now assuming that yi is generated from f{x.) using iid 
noise as in Equation ([T]) and denoting y — [j/i, . . . , ynY' 
we get that y ^ Af{0, K -I- ct^I) and the joint distribution 
is given by 



K 
K K 



K 



(5) 



Using properties of multivariate Gaussians we can see 
that the posterior distribution t\y is given by 

Pi{i\V)^J\f(K{(jH + K)-'^y,a^ ((jH + K)-'^K). (6) 

Similarly, the predictive distribution for some test 
point x^, distinct from the training examples is given by 

Pr{f{x,)\x,,V)^ J PT{f{x,)\x,,f)PTif\V)df 
^u(kix,f{ah + K)-^y, (7) 
IC{x^,x^) - k{x^f{crH + K)-^k{x^) 



where k(a;*) = [IC{xi,x^,), • • • , JC{xn,x^,)]'^ . 

Figure [2] illustrates GP regression, by showing how a 
finite sample induces a posterior over functions and their 
values for new sample points. 

2.2. Problem Definition 
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Fig. 2. — Illustration of prediction with GP regression. The data 
points D = {xi,yi} are given by the crosses. The shaded area 
represents the pointwise 95% confidence region of the predictive 
distribution. As can be seen from Equationl?] GP regression can 
be seen to perform a variant of kernel regression where /(a;») is 
a weighted average of all the measurements y. While the values 
of the weights are obscured because of the inverse of the covari- 
ance matrix in that expression, one can view this roughly by an 
analogy to nearest neighbor regression where the mean of f{x*) 
is affected more by the measurements whose sampling points arc 
close to X* and the variance of f{x*) is small if x* is surrounded 
by measi rrements. A deeper discussion of the equivalent kernel is 
given in | |Rasmussen fc Williams|2005[ |. 



Then, given and fj we sample the observations 

^M{f,{x=),aH). (10) 

Denote the complete set of parameters by = {9, a^}. 
For each time series j, the inference task is to select the 
correct model for the data {x^,y^}, that is, to find A4 
that best describes the data. This is the main computa- 
tional problem studied in this paper. The next subsec- 
tion reviews two standard approaches for this problem. 

Before presenting these we clarify two methodological 
issues. First, notice that our model assumes homoge- 
neous noise A/'(0,cr^), i.e. the observation error for each 
Xi is the same. Experimental results on the astronomy 
data (not shown here) show that estimated from the 
data is very close to the mean of the recorded observation 
errors, and therefore there is no advantage in explicitly 
modeling the recorded observation errors. 

Second, as defined above our task is to find the full 
set of parameters A4. Therefore, our framework and in- 
duced algorithms can estimate the underlying function, 
/, through the posterior mean /, and thus yield a so- 
lution for the regression problem - predicting the value 
of the function at unseen sample points. However, our 
main goal and interest in solving the problem is to infer 
the frequency w where the other parameters are less im- 
portant. Therefore, a large part of the evaluation in the 
paper focuses on accuracy in identifying the frequency, 
although we also report results on prediction accuracy 
for the regression problem. 



In the case of period estimation the sample points Xi 
are scalars Xi representing the corresponding time points, 
and we denote x = [xi, . . . ,x„]^. The underlying func- 
tion /(•) is periodic with unknown period p and corre- 
sponding frequency w — 1/p. To model the periodic 
aspect we use a GP with a periodic covariance function 



K.e{xi,Xj) = P exp 



2 sin^ {w'K{xi — Xj)) 



(8) 



where the set of hyperparameter^of the covariance func- 
tion is given hy 6 = {(3,w,f\. It can be easily seen that 
any / generated by Ke is periodic with period l/w. Fig- 
ure pi) illustrates the role of the other two hyperparam- 
eters. We can see that /? controls the magnitude of the 
sampled functions. At the same time, i which is called 
characteristic length determines how sharp the variation 
is between two points. 

In our problem each star has its own period and shape 
and therefore each has its own set of hyperparameters. 
Our model, thus, assumes that the following generative 
process is the one producing the data. For each time se- 
ries j with arbitrary sample points x'^ 
we first draw 



] IT 



(9) 



^ Typically, in a hierarchical model, the parameters of the top 
level (e.g. parameters of the prior) that affect the next level are 
called hyperparameters. In GP regression, the parameter is the 
regression function / and the hyperparameters are the the param- 
eters of covariance function. 



2.3. Model selection 

2.3.1. Marginal Likelihood 

The standard Bayesian approach is to identify the 
hyper-parameters that maximize the marginal likelihood. 
More precisely, we try to find Ai* such that 

M* = argmax [log [Pr(y|a;; A^)]] (11) 
M 

where the marginal likelihood is given by 
log Pr(y|a;; M) = log (^J Pr(y|/, x; M) Piif\x; M) df 



log\K + a^ 



log 27r 



(12) 



and Equation ( 12 ) holds becaus e y ^ A/'(0, K 
(T^I) (Rasmussen & Williams 2005). Typically, one can 



optimize the marginal likelihood by calculating the par 
tial derivative of the marginal likelihood w.r.t. the hyper- 
parameters and optimi zing the hyper-par ametcrs using 
gradient based search (Rasmussen & Williams 2005). As 
we show below, gradients alone cannot be used to solve 
our problem completely and therefore our algorithm elab- 
orates and improves over this approach. We do, however, 
use the conjugate gradients optimization as a basic step 



in our algorithm. The part ial derivative of Equation ( 12 
w.r.t. the parameter dj is ( [Rasmussen fc Williams|2U05 



86 



^ logPY{y\x;M)=Tr({aa'^ -K-')^-^" 



86. 



(13) 
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Fig. 3. — Sample functions from a GP with covariance function in Equation |8ll where the period is fixed to be 5, i.e. 
^ = 0.1 vs ^ = 10 while I is fixed to be 1. Bottom row: ^ = 0.3 vs ^ = 1 with /T^ 0.3. 



where = K 



2.3.2. Cross- Validation 

An alternative approach ("Rasmussen & Williams 2005 1 
picks hyperparameter Ai by minimizing the empirical 
loss on a hold out set. This is typically done with a 
leave-one-out (LOO) formulation, which uses a single ob- 
servation from the original sample as the validation data, 
and the remaining observations as the training data. The 
process is repeated such that each observation in the sam- 
ple is used once as the validation data. To be precise, we 
choose the hyperparameter M* such that 



M* = argmin 
M 



1=1 



(14) 



where /_i is defined as the posterior mean given the data 
{a;_i,y_i} in which the subscript —i means all but the 
ith sample, that is. 



y I 



(15) 



It can be shown that this com putation can be simplified 
( Rasmussen &: Williams|[2005 1 using the fact that 



y, - f-i{xi) 



[{K + aH)-^y]^ 



(16) 



: 0.2. Top row: 



denotes 



where is the ith entry of the vector and 
the (i, i)th entry of the matrix. 

3. ALGORITHM 

We start by demonstrating experimentally that gra- 
dient based methods are not sufficient for period esti- 
mation. We generate synthetic data and maximize the 
marginal likelihood w.r.t. 6 = {P,w,t\ using conjugate 
gradients. For this experiment, 30 samples in the interval 
[—10, 10] are generated according to the periodic covari- 
ance function in Equation jsl) with = [1,0.25,1]. Fix- 
ing /?, £ to their correct values, the marginal likelihood 
w.r.t. the period 1/w is shown in Figure |4] left. The 
figure shows that the marginal likelihood has numerous 
local minima in the high frequency (small period) region 
that have no relation to the true period. Figure [i] right 
shows two functions with the learned parametersbased 
on different starting points (initial values). 

The function plotted in dark color estimates the true 
function correctly while the one in light color does not. 
This is not surprising because from Figure [4] left, we can 
see that there is only a small region of initial points from 
which the algorithm can find the correct period. We re- 
peated this experiment using several other periodic func- 
tions with similar results. These preliminary experiments 
illustrate two points: 

• When other parameters are known, the marginal 
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Marginal likelihood w.r.t. period 




Period (log scale) 

Fig. 4. — Illustration of sensitivity of the marginal likelihood. A light curve is generated using the GP model with parameters /3 = 1, 
w = 0.25, and £ = 1. Left: The marginal likelihood function versus the period, where the dotted line indicates the true period. Right: 
The black circles are the observations and the dotted line (covered by the dark estimated curve) is the true function. The dark line which 
covers the true curve and the light line are the learned regression functions given two different starting points of w. 



Initialize the parameters randomly, 
repeat 

Jointly find tu, /3* , £* , cr* that maximize Equation l |12| using conjugate gradients, 
for all w in a coarse grid set C do 

Calculate the marginal likelihood Equation | |12[ l or the LOO Error Equation l |14| using /3*,£*,(t*. 
end for 

Set w to the best value found in the for loop, 
until Number of iterations reaches Li (Li = 2 by default) 

Record the Top K {K = 10 by default) frequencies W* found in the last run of for loop (lines 4-6). 
repeat 

Jointly find w,l3*,£*,a* that maximize Equation | |12[ l using conjugate gradients. 

for all ui in a fine grid set that covers W* do^ 

Calculate the marginal likelihood Equation | |12| or the LOO Error Equation ( |14[ l using f)*,£*,a*. 
end for 

Set w to the best value found in the for loop, 
until Number of iterations reaches L2 (I/2 = 2 by default) 

Output the frequency w* that maximizes the marginal likelihood or minimizes the LOO Error in the last run of for loop (lines 
11-13). 



Fig. 5. — Hyperparameter Optimization Algorithm 

likelihood function is maximized at the correct pe- 
riod, showing that in principle we can find the cor- 
rect period by optimizing the marginal likelihood. 

• On the other hand, it is not possible to identify the 
period using only gradient based search. 



before and grid search (in particular usin g two stages) is 



known to be the most effective solution (|Reimann||1994 
Hall et al.|[2000) ) 



Therefore , as in previous work (Reimann 1994 Hall 
et al. 2000), our algorithm uses grid search tor the fre- 
quency, 'i'he grid used for the search must be sufficiently 
fine to detect the correct frequency and this implies high 
computational complexity. We therefore follow a two 
level grid search for frequency where the coarse grid must 
intersect the smooth region of the true maximum and the 
fine grid can search for the maximum itself. The two-level 
search significantly reduces the computational cost. Our 
algorithm, presented in Figure [5] combines this with gra- 
dient based optimization of the other parameters. There 
are several points that deserve further discussion, as fol- 
lows: 

1. In step 3, we can successfully maximize the marginal 
likelihood w.r.t. and using the conjugate gradi- 
ents method, but this approach does not work for the 
frequency w. The reason is that the objective function is 
highly sensitive w.r.t. w and the gradient is not useful for 
finding the global maximum. This property justifies the 
structure of our algorithm. This issues has been observed 



7. Our algorithm uses cyclic optimization estimating 
w, (7, /?, £. That is to say, we fix other parameters a, /?, i 
and optimize w and then optimize tr, /3, £ when w is fixed. 
We keep doing this iteratively but use a small number 
of iterations (in our experiments, the default number of 
iterations is 2). A more complete algorithm would iterate 
until convergence but this incurs a large computational 
cost. Our experiments demonstrate that a small number 
of iterations is sufficient. 

3. In steps 3 and 11 we incorporate w into the joint 
optimization of the marginal likelihood. This yields bet- 
ter results than optimizing w.r.t. the other parameters 
with fixed w. This shows that the gradient of w some- 
times still provides useful information locally, although 
the obtained optimal value w is discarded. 

4. We use an adaptive search in the frequency domain, 
where at the first stage we use a coarse grid and later a 
fine grid search is performed at the neighbors of the best 
frequencies previously found. By doing this, the compu- 
tational cost is dramatically reduced while the accuracy 
of the algorithm is still guaranteed. 

Two additional approximations are introduced next, 
specifically targeting the coarse and fine grids respec- 
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tively and using observations that are appropriate in each 
case. 

3.1. Ensemble Subsampling 

The coarse grid search in hnes 4-6 of the algorithm 
needs to compute the covariance matrix w.r.t. each 
frequency in C and invert the corresponding covari- 
ance matrix, and therefore the total time complexity is 
O {\C\N^). In addition, different stars do not share the 
same sampling points. Therefore the covariance matrix 
and its inverse cannot be cached to be used on all stars. 
The computational cost is too high when the coarse grid 
has a large cardinality. Our observation here is that it 
might suffice to get an approximation of the likelihood at 
this stage of the algorithm, because additional fine grid 
search is done in the next stage. 

Therefore, to reduce the time complexity, we propose 
an ensemble approach that combines the marginal l ikeli- 
hood of several subs ampled times series. The idea (Pro- 
topapas et al. 2005) is that the correct period will get 
a high score tor ail sub-samples, but wrong periods that 
might score well on some sub-samples (and be preferred 
to others due to outliers) will not score well on all of 
them and will thus not be chosen. For the approx- 
imation, we sub-sample the original time series such 
that it only contains a fraction / of the original time 
points, repeating the process R times. The marginal 
likelihood score is the average over the R repetitions. 
Our experiments justify default settings of / = 15% 
(with the additional constraint that 30 < / < 40) and 
R = 10. This approximation reduces the time complex- 
ity to O (|C| X Rx (/7V)3). 

3.2. First Order Approximation with Low Rank 
Approximation 

Similar to the previous case, the time complexity of 
fine grid search is 0{\J^\N^). In this case we can reduce 
the constant factor in the 0{N^) term. Notice that in 
step 13, other parameters are fixed and the grid is fine so 
that the marginal likelihood is a smooth function of w. 
Suppose we have wq,Wi € J-' where is the fine grid and 
Aw ~ \wo — Wi\ < e, where e is a predefined threshold. 
Then, given K^^^, the covariance matrix w.r.t. wq, we 
can get K^-i by its Taylor expansion as 



dK 

dw 



(wo)Aw + o(e2). 



(17) 



Denote K ^ ^ 

gw 

small perturbation to . A t first look, th e Sherman- 



{wq) where KAw can be seen as a 



Morrison- Woodbury formula (Bishop 2006) appears to 
be suitable for calculating the update ot the inverse ef- 
ficiently. Unfortunately, preliminary experiments (not 
shown here) indicated that this method fails due to nu- 
meric instability. Instead, we use an update for the 
Cholesky factors of the matrix and calculate the in- 
verse through these. Namely, given the Cholesky de- 
composition of l^ujo — LL^ we calculate L such that 



AwK 



K^-^ . Details of this construc- 



tion are given in the appendix. 

3.3. Astrophysical Input Improvements 



For some cases we may have further information on the 
type of periodic functions one might expect. We propose 
to use such information to bias the selection of periods, 
by using it to induce a prior over periods or as a post- 
processing selection criterion. The details of these steps 
are provided in the next section. 

4. EXPERIMENTS 

This section evaluates the various algorithmic ideas us- 
ing synthetic and astrophysics data and then applies the 
algorithm to a different set of lightcurves. Our imple- 
men tation of the algorithms make s use of the gpml pack- 
age ( Rasmussen fc Nickisch||2010p ' 



4.1. Synthetic data 

In this section, we evaluate the performance of several 
variants of our algorithm, study the effects of its param- 
eters, and compare it to the two most used methods in 
the literature: the LS periodogram (LS) &mb 1976f 
and p hase dispersion minimization (PDM) ( [stellingwer: 
19781. 



in 
6l 



The LS method ( Lomb 1976 ) chooses w to maximize 
the periodogram defined as: 



Pls{^) = 



1 



2\ Ecos2(,7,) 



(18) 



where rjj — uj{xj — r). The phase r (that depends on w) 
is defined as the value satisfying tan(2cjT) 



COs{2LJXj ) ' 

As shown by (Reimann 1994), LS fits the data with a 
harmonic model using least-squares. 

In the PDM method, the period producing the least 
possible scatter in the derived light curve is chosen. The 
score for a proposed period can be calculated by folding 
the light curve using the proposed period, dividing the 
resulting observation phases into bins, and calculating 



the local variance within each bin, a = 



_ I2j ivj -v) 



N-l 



, where 



y is the mean value within the bin and the bin has TV 
samples. The total score is the sum of variances over all 
the bins. This method has no preference for a particular 
shape (e.g., sinusoidal) for the curve. 

We generate two types of artificial data, referred to as 
harmonic data and GP data below. For the first, data is 
sampled from a simple harmonic function, 

y ^ J\f (asin(i:jx + 0i) + 5cos(wa; -f- (?!)2),o-^l) (19) 

where a,b ^ Uniform(0, 5), uj ^ Uniform(l, 4), ^ 
Af{0,l) and the noise level is set to be 0.1. Note 
that this is the model assumed by LS. For the second, 
data is sampled from a GP with periodic covariance func- 
tion in Equation (|8]). We generate /3,^ uniformly in (0, 3] 
and (0, 3] respectively and the noise level is set to be 
0.1. The period is drawn from a uniform distribution 
between (0.5,2.5]. For each type we generate data un- 
der the following configuration. We randomly sampled 
50 time series each having 100 time samples in the in- 
terval [—5,5]. Then the comparison is performed using 
sub-samples with size increasing from 10 to 100. This 
is repeated ten times to generate means and standard 
deviations in the plots. 

^ http:/ /www. gaussianprocess.org/gpml/code/matlab/doc/ 
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Fig. 6. — Results for harmonic data (left column) and GP data (right column). Left: Accuracy (mean and standard deviation) versus the 
number of samples, where solid lines marked with nml represent GP with marginal likelihood where n denotes the number of iterations. The 
corresponding dotted lines marked nrss denote cross-validation results with n iterations. Middle: Reconstruction error for the regression 
function versus the number of samples. Right: Reconstruction curve of GP in two specific runs using maximum likelihood with different 
number of iterations. 
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The setting of the algorithms is as follows: In our 
algorithm we only use one stage grid search. For our 
algorithm and LS, the lowest frequency fmin to be ex- 
amined is the inverse of the span of the input data 
l/(2:max - Xjnin) = ^/T. The highest frequency /^ax is 
twice the Nyquist frequency /at, which we would obtain, 
if the data points were evenly spaced over the same span 
T, that is /at = N/{2T). We use an over-sample factor 
of 8, meaning that the range of frequencies is broken into 
even segments of l/ST. For PDM we set the frequency 
range to be [0.02, 5] with the frequency increments of 
0.001 and the number of bins in the folded period is set 
to be 15. 

For performance measures we consider both "accu- 
racy" in identifying the period and the error of the regres- 
sion function. For accuracy, we consider an algorithm to 
correctly find the period if its error is less than 1% of the 
true period, i.e., \'p — p\/p ^ 1%. Further experiments 
(not shown here) justify this approach by showing that 
the accuracies reported are not sensitive to the prede- 
fined error threshold. 

The results, where our algorithm docs not use the sam- 
pling and low rank approximations, are shown in Figure[6] 
and they support the following observations. 

1. As expected, the top left plot shows that LS per- 
forms very well on the harmonic data and it outperforms 
both PDM and our algorithm. This means that if we 
know that the expected shape is sinusoidal, then LS is the 
best choice. This confirms the conclusion of other stud- 
ies. For example, in the problem of detecting periodic 



TABLE 1 

Comparison of GPs: Original, Subsampling and 

SUBSAMPLING PLUS LOW RANK ChOLESKY UPDATE. ACC DENOTES 
ACCURACY AND S/TS DENOTES THE RUNNING TIME IN SECONDS 
PER TIME SERIES. 



genes from irregularly sampled gen e expressions (Wen 



tao et al. 2008 



Glynn et al. 2006), the periodic time 



series of interest were exactly sine curves. In this case 
studies showed that LS is the most effective comparing 
to several other statistical models. 

2. On the other hand, the top right plot shows that our 
algorithm is significantly better than LS on the GP data 
showing that when the curves are non-sinusoidal the new 
model is indeed useful. 

3. The two plots in top row together show that our al- 
gorithm performs significantly better than PDM on both 
types of data, especially when the number of samples is 
small. 

4. The first two rows show the performance of the 
cyclic optimization procedure with 1-5 iterations. We 
clearly see that for these datasets there is little improve- 
ment beyond two iterations. The bottom row shows 
two examples of the learned regression curves using our 
method with different number of iterations. Although 
one iteration does find the correct period, the reconstruc- 
tion curves are not accurate. However, here too, there 
is little improvement beyond two iterations. This shows 
that for the data tested here two iterations suffice for 
period estimation and for the regression problem. 

5. The performance of marginal likelihood and cross 
validation is close, with marginal likelihood dominating 
on the harmonic data and doing slightly worse in GP 
data. 

We next investigate the performance of the speedup 
techniques. For this we use GP data under the same con- 
figuration as the previous experiments. The experiment 
was repeated 10 times where in each round we generate 
100 lightcurves each having 100 samples but generated 
from different 9s. For the algorithm we used two itera- 





ORIGINAL 


SUBSAMPLING 


SUB + LOWR 


Acc 


0.831 ± 0.033 


0.857 ± 0.038 


0.849 ±0.028 


s/ts 


518.52 ± 121.49 


197.59 ± 14.10 


170.75 ± 17.93 



tions for cyclic optimization and varied the subsampling 
size, number of repetitions and rank of the approxima- 
tion. Table [l] shows results with our chosen parameter 
setting using sampling rate of 15%, 10 repetitions, ap- 
proximation rank M — [yj and grid search threshold 
e = 0.005. We can see that the subsampling technique 
saves over 60% percent of the run time while at the same 
time slightly increasing the accuracy. Low rank Cholesky 
approximation leads to an additional 15% decrease in run 
time, but gives slightly less good performance. Figure [7| 
plots the performance of the speedup methods under dii-^ 
ferent parameter settings. The figure clearly shows that 
the chosen setting provides a good tradeoff in terms of 
performance vs. run time. 

4.2. Astrophysics Data 

In this section, we estimate the periods of unfol ded as- 
trophysics tim e series from the OGLEII survey ( jSoszyn- 
ski et al |[2003l ). 

QUEET surveyed the sky over a number of years and has 
a huge number of light sources. The data we use here is a 
subset of OGLEII, containing a total of 14087 light curves 
of periodic variable stars that have previously been iden- 
tified to be periodic (and thus their period is known) and 
to be members of one of 3 types: Cepheids, RR Lyrae, 
and Eclipsing Binary (illustrated in Figure Isl) . 

We first explore, validate and develop our algorithm 
using a subset of OGLEII data and then apply the algo- 
rithm to the full OGLEII datgj^ except this development 
set. The OGLE subset is chosen to have 600 time series 
in total where each category is sampled according to its 
proportion in the full dataset. 



4.2.1. Evaluating the General GP Algorithm 

The setting for our algorithm is as follows: The grid 
search ranges are chosen to be appropriate for the ap- 
plication using coarse grid of [0.02, 5] in the frequency 
domain with the increments of 0.001. The fine grid is 
a 0.001 neighborhood of the top frequencies each having 
20 points with a step of 0.0001. We use K = 2Q top 
frequencies in step 9 of the algorithm and vary the num- 
ber of iterations in a cyclic optimization. When using 
sub-sampling, we use 15% of the original time series, but 
restrict sample size to be between 30 and 40 samples. 
This guarantees that we do not use too small a sam- 
ple and that complexity is not too high. For LS we use 
the same configuration as in the synthetic experiment. 
Results are shown in Table [2] and they mostly confirm 
our conclusions from the synthetic data. In particular, 
ML is slightly better than CV and subsampling yields a 
small improvement. In contrast with the artificial data, 
more iterations do provide a small improvement in per- 
formances and 5 iterations provide the best results in this 

http: / /www. cs. tufts. odu/rcsearch/ml/index.php?op=data_software 
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Fig. 7. — Accuracy (solid line) and Run time (dash-line) of approximation methods as a function of their parameters. Left; sub-sampling 
ratio (with R = 10). Middle: number of repetitions (with 15% sub-sampling). Right: rank in low rank approximation. 




Fig. 8. — Examples of light curves of periodic variable stars folded according to their period to highlight the periodic shape. 
Cepheid, middle: RR Lyrae, right: Eclipsing Binary. 



Left: 



TABLE 2 

Comparisons of different CPs on OGLEII subset, gp-ml 

AND GP-CV are GP with THE ML AND CV CRITERIA. SGP-ML 

and sgp-cv are the corresponding subsampling versions. 
The first column denotes the number of iterations. 





GP-ML 


GP-CV 


SGP-ML 


SGP-CV 


LS 


llTR ACC 


0.7856 


0.7769 


0.7874 


0.7808 


0.7333 


2lTR ACC 


0.7892 


0.7805 


0.7910 


0.7818 




3lTR ACC 


0.7928 


0.7806 


0.7964 


0.7845 




4lTR ACC 


0.7946 


0.7812 


0.7982 


0.7875 




5lTR ACC 


0.7964 


0.7823 


0.8000 


0.7906 





experiment. Finally, we can also see that all of the GP 
variants outperform LS. 

Although this is an improvement over existing algo- 
rithms ac curacy of 80% is still not satisfactory. As dis- 
cussed by Wachman ( 2009 ) , one particularly challenging 
task is finding the true period of EB stars. The difficulty 
comes from the following two aspects. First, for a sym- 
metric EB, the true period and half of the true period 
are not clearly distinguishable quantitatively. Secondly, 
methods that are better able to identify the true period 
of EBs are prone to find periods that are integer multi- 
ples of single bump stars like RRLs and Cepheids. On 
the other hand, methods that fold RRLs and Cepheids 
correctly often give "half" of the true period of EBs. In 
particular, the low performance of LS is due to the fact 
that it gives a half or otherwise wrong period for most 
EBs. 

To illustrate the results Figure |9] shows the periods 
found by our method and by GP on 4 stars. The top 
row shows 2 cases where the GP method finds the correct 
period and LS finds half the period. The bottom row 
shows cases where LS identifies the correct period and the 



GP does not. In the example on the left the GP doubles 
the period. In the example on the right the GP identifies 
a different period from LS but given the spread in the 
correct period the period it uncovers is not unreasonable. 

4.2.2. Incorporating Domain Knowledge 

We next show how this issue can be alleviated and 
the performance can be improved significantly using a 
learned probabilistic generative model. The methods de- 
veloped are general and can be applied whenever such 
a model is available. As illustrated in Figure |8j our 
astrophysics knowledge suggests that different types of 
stars have different typical shift-invariant "shapes" . In 
addition, each class has more than one such shape and 
each individual star has some variation from the com- 
mon shape. We use the Shift-invariant Grouped Mixed- 
effect Model (gmt) (Wang et al. 2010), which captures 
the common "shapes via a mixture of Gaussian pro- 
cesses while at the same time allowing for individual 
variations. This model was previously developed to cap- 
ture and aid in the classification of the astrophysics data. 
Once model parameters are learned we can calculate the 
likelihood of a light curve folded using a proposed pe- 
riod. Given the models, learned from a disjoint set of 
time series, for Cepheids, EBs and RRLs with parame- 
ter sets Mi, i — {C, E, R}, there are two perspectives on 
how they can be used: 

Model as Prior: The models can be used to induce an 
improper prior distribution (or alternatively a penalty 
function) on the period p. Given period p and sample 
points X the prior is given by 



Pr(p) 



max 

ie{C,E.R} 



iPTiy\x,p;M,)) 



(20) 



10 



Lighlcurve folded by GP (Correct) Lightcurve folded by LS (Incorrect) LIglitcurve folded Oy GP (Correct) Llghtcurve folded by LS (Incorrect) 




Phase Phase Phase Phase 



Fig. 9. — Examples of light curves where GP and LS and identify different periods and one of them is correct. Each pair shows the time 
series folded by GP on the left and LS on the right. The top row shows cases where LS identifies half the period. The bottom row shows 
cases where GP identifies double the period or a different period. 



TABLE 3 

Comparison of different regularization parameters on OGLEII subset using MAP. 
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.1 


.3 


.5 


.7 


.9 


1 


AGO 


0.87027 


0.85946 


0.81802 


0.81802 


0.80901 


0.80721 


0.8 



where from the perspective oi Aii, x and correspond- 
ing points in y are interpreted as if they were sampled 
modulo p. Thus, combining this prior with the marginal 
likelihood, a Maximum A Posteriori (MAP) estimation 
can be obtained. Adding a regularization parameter 7 
to obtain a tradeoff between the marginal likelihood and 
the improper prior we get our criterion: 



logFTip\x,y;M) 



J logFi {y\x,p;M) 
+ (l-7)logPr(p) 



(21) 



where Pr(y|a;,p; A^) is exactly as Equation (12) where 
the period portion of Ai is fixed to be p. When using 
this approach with o ur algorithm we use Equation (21 1 
instead of Equation ( 12 ) as the score function in lines 5 



and 13 of the algorithm. The results for different values 
of 7 (with subsampling and 5 iterations) are shown in 
Table [3] The results show that GMT on its own (7 — 0) 
is a good criterion for period finding. This is as one might 
expect because the OGLEII dataset includes only stars 
of the three types captured by GMT. 

In this experiment, regularized versions do not improve 
the result of the GMT model. However, we believe that 
this will be the method of choice in other cases when the 
prior information is less strong. In particular, if the data 
includes unknown shapes that are not covered by the 
generative model then the prior on its own will fail. On 
the other hand when using Equation ( |21[ ) with enough 
data the prior will be dominated by the likelihood term 
and therefore the correct period can be detected. In con- 
trast, the filter method discussed next does not have such 
functionality. 

Model as Filter: Our second approach uses the model as 
a post-processing filter and it is applicable to any method 
that scores different periods before picking the top scor- 



ing one as its estimate. For example, suppose we are 
given the top K best periods {Pi},i = 1, • • • ,K found 
by LS, then we choose the one such that 

p* — argmax I max [\ogPT{y\x,pi; Mj)] I . (22) 



\je{C,EM} 

Thus, when using the GMT as a filter, step 17 in our 
algorithm is changed to record the top K frequencies 
from the last for loop, evaluate each one using the GMT 
model likelihood, and output the top scoring frequency. 

Heuristic for Variable Periodic Stars: The two ap- 
proaches above are general and can be used in any prob- 
lem where a model is available. For the astrophysics 
problem we develop another heuristic that specifically 
addresses the half period problem of EEs. In particular, 
when using the filter method, instead of choosing the top 
K periods, we double the selected periods, evaluate both 
the original and doubled periods {pi, 2pi} using the GMT 
model, and choose the best one. 

Results of experiments using the filter method with 
and without the domain specific heuristic are given in 
Table l4j based on the 5 iteration version of subsampling 
GP. The filter method significantly improve the perfor- 
mance of our algorithm showing its general applicability. 
The domain specific heuristic provides an additional im- 
provement. For LS, the general filter method does not 
help but the domain specific heuristic significantly im- 
proves its performance. By analyzing the errors of both 
GP and LS, we found that their error regions are differ- 
ent. Therefore, we further propose a method that com- 
bines the two methods in the following way: pick the 
top K periods found by both methods and evaluate the 
original and doubled periods using the GMT to select the 
best one. As Table |4] shows, the combination gives an 
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TABLE 4 

Comparisons of different algorithms on OGLEII subset 
using the gmt as a filter. single denotes without the 
double period heuristic. 





ORIGINAL 


SINGLE FILTER 


FILTER, 


LS 


0.7333 


0.7243 


0.9053 


GP 


0.8000 


0.8829 


0.9081 


LS+GP 




0.8811 


0.9297 



additional 2% improvement on the OGLEII subset. 

4.2.3. Application 

Finally, we apply our method using marginal likelihood 
with two level grid search, sub-sampling at 15%, 2 itera- 
tions, and filtering on the complete OGLEII data set with 
13974 instances minus the development OGLEII subset. 
Note that the parameters of the algorithm, other than 
domain dependent heuristics, are chosen based on our 
results from the artificial data. The accuracy is reported 
using 10-fold cross validation under the following setting: 
the GMT is trained using the training set and we seek to 
find the periods for the stars in the test set. W e compare 
our results to the best result from ( WachmanF2009 ) that 



used an improvement of LS, despite the fact that they 
filtered out 1719 difficult stars due to insufficient sam- 
pling points and noise. The results are shown in Table [sj 
We can see that our approach significantly outperforms 
existing methods on OGLEII. 

5. RELATED WORK 

Period detection has been extensively studied in the lit- 
erature and especially in astrophysics. The periodogram, 
as a tool for spectral analysis, dates back to the 19th 
century when Schuster applied it to the analysis of some 
data sets. The behavior of t he periodo g ram in estimating 
frequency was discussed by Deeming (1975). The peri- 
odogram is defined as the modulus-s q uared o f its dis crete 
Fourier transfo rm (Deeming 19751. 
Scargle ( 1982 ) introduced the so-cal 



Lomb| ( |1976[ ) and 

led Lomb-Scargle 

(LS) Periodogram that was discussed above and which 
rates periods based on the sum-of-squares error of a 
sine wave at the give n period. This method has been 
used in astrophysics (Gumming 2004 Wachman 2009|) 
and has also been useoln Bioiniormatics (Glynn et al. 
2006| iWentao et al.||2008|). One can show that the LS 
do. 



periodogram is identical to the equation we would de- 
rive if we attempted to estimate the harmonic content 
of a data set at a specific freque ncy using the linear 
least-squares model (Scargle 1982). This technique was 



originally narn ed least-squares spectral analysis method 
Vanfcek| ( 1969 ). Many extensions of the LS periodograrn . 
exist m theliterature (B retthorst|2001[ ). [Hall fc Li| |2006 1 
proposed the periodogram tor non-parametric regression 
models and discussed its statistical properties. This 
was later applied to the situation where the regression 
model is the superposition of functions with different pe- 
riod (jHall 2008). 

The other main approach uses least-squares estimates, 
equivalent to maximum likelihood methods under Gaus- 
sian noise assumption, using different choices of periodic 
regression models. This approach, using finite-parameter 
trigonometric series o f different orders, has been explored 
by various authors ( Hartley"1949'; Quinn & Thomson 
1991] IQuinn fc Fernajrdes„1991, .Quinn 19991 IQuinn fc 



Hannan 2001). Notice that if the order of the trigono- 
metric series is high then t his is very close to nonpara- 
metric methods f Hall| |2008[ ) . 

Another intuition is to minimize some measure of dis- 
persion of the data in phase space. Phase Dispersion 
Minimization ( |Stellingwerf ^1978J , described above, per- 
forms a least squares ht t o the mean curve define d by 
averaging points in bins. Lafler fc Kinman ( 1965[ ) de- 
scribed a procedure which involves trial-period folding 
followed by a minimization of the differences between 
observations of adjacent phases. 

Other least squares methods use smoothing based 
on splines, robust s plines, or variable-span smoothers. 



Graven fc Wahba ( |1978[ ) discussed the problem of 
smoothing periodic curve with spline functions in the 
regularization framework and invented the generalized 
cross- Validati on (GCV) score to estimate the period of a 



variable star. Oh et al. (2002) extended it by substitut 
ing the smoothing splines with robust splines to alleviate 
the effects caused by outliers. Supersmoother, a variable- 
span smoother based on run ning linear smoo ths, is used 
for frequency estimation in ( McDonald||1986 ) . 

Several other approaches exist m Hie literature 
haps 



(Hall et al. 2000) 



Per- 
who 



m 

1 



the most related work is 
studied nonparametric models for frequency estimation 
including the Nadaraya- Watson estimator, and discussed 
their statistical properties. This was ext ended to perform 
inference for multi-period functions ('Hall & Yin 2003f 
and evolving periodic functions (Gcnton fc Hall 200 
Hall|[2008( ). Our work differs from (Hall et al 2000) m 
three aspects: 1) the GP framework presented in this pa- 
per is more general in that one can plug in different peri- 
odic covariance functions for different prior assumptions; 
2) we use marginal likelihood that can be interpreted 
to indicate how the data agrees with our prior belief; 3) 
we introduce mechanisms to overcome the computational 
complexity of period selection 



Other approac hes include entropy minimization (Hui- 



]se et al. 
transform 



2011 



data compe nsated discrete Fourier 

'erraz-M ellol |19 81[), and Bay esian models 

d Gregory & Lorcdo 1996; Scargle 1998). Recently, 
Bayesian methods have also been applied to solve the fre- 
quency estimati on problem, such as Bayes ian binning for 
Poisson -regime ( Gre gory fc Loredo 1996) and Bayesian 
blocks (|Scargle||l998|. 



6. CONCLUSION 

The paper introduces a nonparametric Bayesian ap- 
proach for period estimation based on Gaussian pro- 
cess regression. We develop a model selection algorithm 
for GP regression that combines gradient based search 
and grid search, and incorporates several algorithmic im- 
provements and approximations leading to a considerable 
decrease in run time. The algorithm performs signifi- 
cantly better than existing state of the art algorithms 
when the data is not sinusoidal. Further, we show how 
domain knowledge can be incorporated into our model 
as a prior or post-processing filter, and apply this idea 
in the astrophysics domain. Our algorithm delivers sig- 
nificantly higher accuracy than existing state of the art 
in estimating the periods of variable periodic stars. 

An important direction for future work is to extend 
our model to develop a corresponding statistical test 
for periodicity, that is, to determine whether a time se- 
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TABLE 5 

Comparisons of accuracies for full set of OGLEII. 





METHOD IN ( WACHMAn|2009|| 


ls-filter 


gp-filter 


gp-ls-filter 


ACC 


0.8680 


0.8975 ± 0.04 


0.8963 ± 0.03 


0.9243 ±0.03 



ries is periodic. This will streamline the application of 
our al gorithm to new astro physics catalogs such as MA- 
CHO (Alcock et al. 19931 where both periodicity test- 
ing and period estimation are needed. Another impor- 
tant direction is establishing the th eoretical properties 



It can be easily seen that K is a real symmetric ma- 
trix. Denote its eigendecomposition a.s K = UAU'^, 
then it can be written as the sum of a series of rank one 
components, 



of our method. Hall et al. (20001 provided the first- n 

order properties of nonparametric estimators such that — y^sgn(Ai) f -\/|Ai|ui) f \/|Ai|ui) 

under mild regularity conditions, the estimator is con- V / V / 



under mild regularity conditions, the estimator is con- 
sistent and asymptotically normally distributed. Our 
method differs in two ways: we use a GP regressor in- 
stead of Nadaraya- Watson estimator, and we choose the 
period that minimizes marginal likelihood rather than us- 
ing a cross-validation estimate. Based on the well known 
connection between kernel regression and GP regression, 
we conjecture that similar results exist for the proposed 
method. 
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APPENDIX 
A. LOW RANK APPROXIMATION 

In this appendix, we complete the details on how 
the first order approximation with low rank approxi- 
mation can be achieved by a series of rank one up- 
da tes/downdate s of the Cholesky factors. As shown 
by Seeger ( 2007[) each such update can be done in 0{N'^) 
using a series of Givens rotations. 



(Al) 



where is the zth eigenvalue and Uj is the correspond- 
ing eigenvector. Furthermore, we perform a low rank 

approximation to K such that 



K 



M 

E 

i=l 



sgn(A(,)) (^|A(i)|u(i)) (y^|A(,)|u(,)) (A2) 



where M < is a predefined rank and A(i) and uj^) 
are the ith largest (in absolute value) eigenvalue and its 
corresponding eigenvector. Therefore we have. 



i=l 



, (A3) 

where ii = We can see that the complex- 

ity for calculating the Cholesky factor of K^^ becomes 
0{MN'^). Therefore, we can choose an e-net £ of the 
fine grid such that \/w G T,swp^^£ \w — v\ < e, perform 
the exact Cholesky decomposition directly only on the e- 
net, and use the approximation on the other frequencies. 
In this way we reduce the complexity from 0{\J-'\N'^) to 
0{\£\N^ + \T\MN^). 
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